Lending Club 2007-2020Q3

Consolidated Data of All the years

Dataset: https://www.kaggle.com/ethon0426/lending-club-20072020q1

Objectives:

  1. Build predictive models for default rate, e.g, boosting trees (XGBoost, Light GBM), neural network and generalized additive model (GAM).
  2. Explain models using Shapley values.

Reference:

  1. https://shap.readthedocs.io/en/latest/index.html
  2. https://christophm.github.io/interpretable-ml-book/shapley.html

Download data and data description

Download data

Import library and functions

Data description

Load data

Select features

Data preprocessing

Target variable: loan_status

int_rate

term

addr_state

emp_title

emp_length

home_ownership

annual_inc

dti

earliest_cr_line

earliest_cr_line: The month the borrower's earliest reported credit line was opened

It is transformed into the time elapsed (in months) from earliest_cr_line to the issue_d.

fico_range

revol_bal

revol_util

bc_util

mort_acc

missing values in mort_acc are filled with the mean of the group of total_acc. The assumption is that for the people with the same total_acc, they likely to have same percentage of mort_acc.

pub_rec_bankruptcies

delinq_2yrs

inq_last_6mths

inq_last_12m

mths_since_last_delinq

platform-dependent grade (rating)

initial_list_status

mo_sin_old_il_acct

No obvious difference between the distributions of Charged-Off and Fully Paid

mo_sin_old_rev_tl_op

mths_since_last_major_derog

mths_since_rcnt_il

mths_since_recent_bc_dlq

mths_since_recent_revol_delinq

drop some columns

Check NA values

Correlation

Train/test data

Encoding categorical features

Train/test split

Scale data

Analyze class imbalance in the targets

Create a train-test dataset for tuning hyperparameters

Bayesian Optimization for hyperparameters

https://github.com/fmfn/BayesianOptimization

https://www.kaggle.com/clair14/tutorial-bayesian-optimization

Logistic Regression

Model Training
Evaluation

XGBoost

Model Training
Evaluation
Feature Importance
Shap Values

reference: https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

Explanation of base value (explainer_expected_value)

Base value: the mean of the model output over the background dataset. In other words, it would be predicted output if we did not know any features to the current output. (It can also be considered as a prior output.)

In practice, it can be computed based on

  1. the bias term of the model , i.e., the final columns the output from xgbm.predict(dsample, pred_contribs=True ;
  2. the mean of logit(p) in the train dataset, i.e., $\dfrac{1}{N} \sum_{i=1}^{N}logit(\hat{p}_{i})$, where $\hat{p}_{i}$ is the predicted probability of the $i$-th instance in the train dataset and $logit(p) = \log (\dfrac{p}{1-p}) = \beta_{0}+\beta_{1}f_{1}(x_{1})+\cdots$.

reference: Page 5 on https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

Explain feature importance for an instance in train data

Explain feature importance for an instance in test data

Explain feature importance for large amount of instances

  1. int_rate is the most important features, which has the largest mean absolute shap values computed for selected samples.
  2. low int_rate mostly decreases the default probability, while there is a large portion of high int_rate that contributes to increase the default probability.
  3. term_36 has mostly negative shap values, i.e., decrease the default probability.

Dependence plot

SHAP Interaction Values https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/NHANES%20I%20Survival%20Model.html?highlight=dependence_plot#Compute-SHAP-Interaction-Values

https://xgboost.readthedocs.io/en/latest/gpu/index.html

https://christophm.github.io/interpretable-ml-book/shap.html

Light GBM

Model Training
Evaluation
Feature Importance
Shap values

Explain feature importance for an instance in train data

Explain feature importance for an instance in test data

Explain feature importance for large amount of instances

Dependence plot

SHAP Interaction Values https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/NHANES%20I%20Survival%20Model.html?highlight=dependence_plot#Compute-SHAP-Interaction-Values

Neural Network

Model Training
Evaluation
Shap Values

Explain feature importance for an instance in train data

Explain feature importance for an instance in train data

GAM

Model Training

Model Performance Comparison

Load models

Evaluation

On train data

On test data

Evaluation Summary